Please submit your responses as a nicely formatted notebook (.ipynb) file.
Show your work for calculation questions (i.e., show the R code you used to answer the question).
If your answer includes many decimal places (e.g., 3.14159265358979), please round it to a reasonable number of decimals for readability (typically 3 to 4).
For ease of marking, avoid showing the outputs of unnecessary values.
Only the following packages are permitted:
tidyverse
All plots must be made with ggplot2 and include clear, descriptive axis titles rather than default column names.
Make sure your code runs without errors and shows all required outputs.
Question 1
Calculate \(\bar{x}\), \(\text{ss}\), \(s^2\), and \(s\) of the following data. You are not allowed to use the functions mean(), sd(), or var().
# Sum of Squared Deviationsss <-sum((x - m)^2)round(ss, 4)
[1] 4770.308
# Variancevar <- ss/(n -1)round(var, 4)
[1] 397.5256
# Standard Deviations <-sqrt(var)round(s, 4)
[1] 19.938
Question 2
Assuming the data from the previous question is normally distributed, calculate the points the middle 95% of the data fall within. Do not use the functions pnorm() or qnorm(); instead, use the values in the graph below (from the lecture) to determine the middle boundry.
Normal Ranges
# Lower Boundlower <- m -1.96* s# Upper Boundupper <- m +1.96* s
[1] "[627.1522, 705.30934]"
Tip
If you would like to display the output as an interval like I have done, you can just use paste0() function:
Assuming the data from question 1 is normally distributed, calculate the points the middle 68% of the data fall within. Do not use the functions pnorm() or qnorm(); instead, use the graph provided in the previous question.
# Lower Boundlower <- m -1* s# Upper Boundupper <- m +1* spaste0("[", round(lower, 5), ", ", round(upper, 5), "]")
[1] "[646.29272, 686.16881]"
Question 4
Use R code to calculate \(\mu\), \(\text{ss}\), \(\sigma^2\), and \(\sigma\) for the following numbers:
10, 10, 10, 10, 10, 10, 10, 10, 10, 10
Why do you think the answer turned out the way it did?
The data has no spread (i.e. does not vary/deviate); hence our statistical measures of spread (the sum of squared deviations, variance, and standard deviation) are just 0.
Important
If you used sd() or var() in your R code, you did this wrong. Those functions calculate the sample standard deviation and sample variance, respectively.
Question 5
Which of the following are measures of spread?
Range
IQR
Quartiles
Median
Mean
Sum of Squared Deviations
Variance
Standard Deviation
Z-score
Range, IQR, Sum of Squared Deviations, Variance, Standard Deviation.
Question 6
Using the Pearson height data from the lecture, transform the sons’ scores to have a mean of 500 and a standard deviation of 500. Then plot a histogram of the new distribution with an interesting fill and edge colour.
From the course website, download the dataset Pokemon.csv. This data set presents statistics for each of the pokémon characters in the popular video game franchise “Pokémon.”
Use this data set to plot a density plot for the variable “HP.” Give the density plot a fill colour of “Pikachu yellow”, which is defined by Nintendo as “#F6CF57.”
HP stands for hit points, or health points, and defines how much damage a pokémon can withstand before fainting.
Fun Fact
Some of the columns in this dataset contain Japanese characters. If you open this in Microsoft Excel, there is a good chance they won’t display properly. This is because, by default, Microsoft Excel does not read CSV files using UTF-8 character encoding despite that being the accepted standard across the world. Why? I don’t know - but it is probably because they want you to use .xlsx files and simply don’t care about anything that isn’t controlled by Microsoft. Note, if you want a spreadsheet software that will use UTF-8 character encoding by default, try Libre Office or even Google Sheets.
Plot a Q-Q plot of the pokémon HP column and indicate whether or not you believe the data is normally distributed (describe why you believe this - appeal to both the density and Q-Q plot).
Make sure the Q-Q-plot incorporates Pikachu yellow in some way.
The data is not acceptably normal. The density plot shows a long tail in the positive direction which indicates the presense of values that are likely to pull the mean and inflate the standard deviation. The Q-Q plot shows large portions of the ends of the distribution not aligning diagonally like you would expect normally distributed data to.
Note for grading
It’s also fine to label the x and y axes as “Expected Quantiles (or Values)” and “Observed Quantiles (or Values)” respectively. The y-axis could also be labelled as its original units (e.g., “HP Values”).
Question 9
Use Tukey’s Ladder of Powers to find a transformation that makes the pokémon HP data more normal/Gaussian. Specify which value of lambda (\(\lambda\)) you used and plot a histogram of the transformed data. Ensure that your x-axis label communicates the transformation you applied.
Note: the function transformTukey() is NOT allowed to be used for this question. You need to experiment and walk up/down the ladder yourself. (You do not need to show all your attempts, just show the final result). An exhaustive search for the exact value of lambda is unnecessary. Use only the values provided in the lecture table.
The square root was used (a.k.a \(\lambda = \frac{1}{2}\)).
Note
The cubic root is also an acceptable transformation (a.k.a \(\lambda = \frac{1}{3}\)).
Note
Use of expression() is not necessary for full marks. Something like x = "HP^(1/2)" is acceptable as well.
Question 10
Is the pokémon HP data now acceptably normal?
# To evaluate this properly, you need check a Q-Q plot:ggplot(poke, aes(sample = hp_t)) +stat_qq() +stat_qq_line() +labs(x ="Theoretical Quantiles",y ="Sample Quantiles" )
While there is some improvement, the data is still not acceptably normal since, in both the histogram and Q-Q plot, you can see very extreme outlying values which are probably inflating the standard deviation and shifting the mean.
Note for grading
Evaluation of normality needs to include, at a minimum, a Q-Q plot.
Statistical tests for normality should not be used (given mark deductions).
It’s also fine to label the x and y axes as “Expected Quantiles (or Values)” and “Observed Quantiles (or Values)” respectively. The y-axis could also be labelled in its transformed units (e.g., x = "HP^(1/2)).
Question 11
For the variable HP, calculate the following statistics using the original (untransformed) pokémon data:
Mean
Sample Standard Deviation
You are not allowed to use the mean(), var(), or sd() functions.
# Standard Deviations <-sqrt(sum((poke$hp - m)^2) / (N -1))s
[1] 26.57601
Question 12
For the variable HP, calculate the following statistics using the original (untransformed) pokémon data:
20% Trimmed Mean
20% Winsorized Standard Deviation
You are not allowed to use the mean() function or any function from the WRS2 package.
Helpful R tools for this task
sum()
length() / nrow()
rep()
sort()
floor()
Indexing with []
# Sample SizeN <-length(poke$hp)# Sort datahp <-sort(poke$hp)# Amount of values to remove from each sidetrim <-floor(N *0.20)# Subset rows to keephp_trim <- hp[(trim +1) : (N - trim)]# 20% Trimmed meansum(hp_trim) /length(hp_trim)
[1] 66.09563
# Add winsorized values to trimmed datahp_win <-c( hp_trim,rep(min(hp_trim), trim), # repeating the min valuerep(max(hp_trim), trim) # repeating the max value)# 20% Winsorized Standard Deviationsd(hp_win)
[1] 14.44225
Question 13
Using Hampel’s identifier, list the outliers in the original (untransformed) pokémon HP values. Do not use the mad() function.
When listing the outliers, only display the following columns in this order: $name, $pokedex_number, and $hp.
Note
There are numerous different methods for selecting columns. Below is an elegant tidyverse method.
Use the R functions pnorm() and qnorm() to answer questions a - p.
Note
There are multiple correct ways to apply these functions for some of the questions.
a.
What percentile is a z-score of 0 at?
pnorm(0, mean =0, sd =1) *100
[1] 50
b.
What percentile is a z-score of -1.96 at?
pnorm(-1.96, mean =0, sd =1) *100
[1] 2.49979
c.
What percentile is a z-score of 1.96 at?
pnorm(1.96, mean =0, sd =1) *100
[1] 97.50021
d.
What probability does a z-score of 1.96 correspond to?
pnorm(1.96, mean =0, sd =1)
[1] 0.9750021
e.
What percentile is a score of 202 at for a normal distribution with a mean of 224 and standard deviation of 29?
pnorm(202, mean =224, sd =29) *100
[1] 22.40397
f.
What percentile is a score of 224 at for a normal distribution with a mean of 224 and standard deviation of 29?
pnorm(224, mean =224, sd =29) *100
[1] 50
g.
What score is at the 2.5th percentile for a normal distribution with a mean of 224 and standard deviation of 29?
qnorm(0.025, mean =224, sd =29)
[1] 167.161
h.
What score is at the 97.5th percentile for a normal distribution with a mean of 224 and standard deviation of 29?
qnorm(0.975, mean =224, sd =29)
[1] 280.839
i.
What z-score represents the 2.5th percentile?
qnorm(0.025, mean =0, sd =1)
[1] -1.959964
j.
z-score represents the top 2.5% of a normal distribution?
qnorm(0.025, mean =0, sd =1, lower.tail =FALSE)
[1] 1.959964
k.
What z-score is at the 97.5th percentile?
qnorm(0.975, mean =0, sd =1)
[1] 1.959964
l.
What score represents the top 2.5% of a normal distribution with a mean of 224 and standard deviation of 29?
qnorm(0.025, mean =224, sd =29, lower.tail =FALSE)
[1] 280.839
m.
What score represents the top 5% of a normal distribution with a mean of 224 and standard deviation of 29?
qnorm(0.05, mean =224, sd =29, lower.tail =FALSE)
[1] 271.7008
n.
What z-score represents the top 5% of a normal distribution?
qnorm(0.05, mean =0, sd =1, lower.tail =FALSE)
[1] 1.644854
o.
What score represents the top 5% of a normal distribution with a mean of 132 and standard deviation of 57?
qnorm(0.05, mean =132, sd =57, lower.tail =FALSE)
[1] 225.7567
p.
What score represents the bottom 9.3% of a normal distribution with a mean of 132 and standard deviation of 57?
qnorm(0.093, mean =132, sd =57)
[1] 56.61721
Question 15
Which distribution contains the largest spread/variability?
Histogram C
To discern the answer, you need to gauge the scale of the x-axis.
Question 16
The following plot is a standard normal distribution. What percentile is represented by the coloured (non-grey) region.
pnorm(-0.5, mean =0, sd =1) *100
[1] 30.85375
Question 17
The following plot is a standard normal distribution. What percentile is represented by the coloured (non-grey) region.
pnorm(-1, mean =0, sd =1) *100
[1] 15.86553
Question 18
The following distribution has a mean of 100 and standard deviation of 15. What percentile is represented by the coloured (non-grey) region.
pnorm(70, mean =100, sd =15) *100
[1] 2.275013
Question 19
The following distribution has a mean of 100 and standard deviation of 15. What percentage of the data is represented by the coloured (non-grey) region.
pnorm(120, mean =100, sd =15, lower.tail =FALSE) *100
[1] 9.121122
Question 20
Bobby’s brother constantly tells him he’s abnormal. To prove this, his brother obtained a sample of test scores from Bobby’s class (Bobby was included in this sample). His classmates’ scores were 70, 68, 75, 72, 76, 74, 69, 82, 73, and Bobby scored a 56. If “abnormal” is operationally defined as being at least 2 standard deviations away from the mean, is Bobby abnormal? Assume the data follows a normal distribution.
A z-score shows Bobby to be 2.29 standard deviations below the mean. Therefore, Bobby’s brother is (technically) correct. Bobby is abnormal.
Note
This could have also been solved by calculating an interval spanning \(\pm\) 2 standard deviations.
Question 21
From the previous question, what is the probability of a student scoring at Bobby’s mark or less?
bob_prob <-pnorm(bobby, mean = m, sd = s)bob_prob
[1] 0.01102511
Question 22
What is the probability of a student scoring higher than Bobby?
1- bob_prob
[1] 0.9889749
Question 23
We have sent out everyone in a large introductory course to check whether people use seat belts. Each student has been told to look at 100 cars and count the number of people wearing seat belts. The number found by any given student is considered that student’s score. The mean score for the class is 44, with a standard deviation of 7. A student who has done very little work all year has reported finding 62 seat belt users out of 100. Do we have reason to suspect that the student just made up a number rather than actually counting? Assume the data collected by the class follows a normal distribution.
p <-pnorm(62, mean =44, sd =7)p
[1] 0.994936
The student’s score of 62 is at, approximately, the 99th percentile, which is exceedingly rare given all the other data collected. It seems very likely the student fabricated their answer.
Other possible ways of solving this:
You could compute a z-score. A z-score that is greater than 1.96 (or 2) standard deviations from the mean would (according to common convention) indicate a statistically rare value and thus suggest the student made up the number.
z <- (62-44) /7z
[1] 2.571429
You could compute a interval encompassing the middle 95% of the data. If the student’s score of 62 falls outside of that range, it’s probably fabricated. i.e.,
From the previous question, what is the minimum and maximum number of seatbelts that could have been “observed” by this student without drawing suspicion? Assume anything in the top and bottom 2.5% of the distribution is suspicious. Round to the nearest seatbelt.
# Min (bottom 2.5%)round(qnorm(p =0.025, mean =44, sd =7))
[1] 30
# Max (top 2.5%)round(qnorm(p =0.975, mean =44, sd =7))
[1] 58
Question 25
A researcher counted the number of behavioural responses a sample of participants completed during a period of operant extinction (i.e., the participaints were no longer reinforced/rewarded to perform a task). The data was positively skewed but the researcher nonetheless analyzed the results in terms of means and standard deviations. They reported a mean of \(5.88\) and a standard deviation of \(8.06\). Why was this an incredibly poor way to analyze the results? Hint: consider where \(0\) responses might fall if this distribution was symmetric and unimodal.
resp <-pnorm(q =0, mean =5.88, sd =8.06)resp
[1] 0.2328392
Using means and standard deviations assumes the data is symmetric. Given the mean and standard deviation they report, approximately 23% of their data must be falling below 0. This implies that they observed participaints completing negative amounts of responding! That doesn’t make any sense.
Question 26
Recall that in Questions 2 and 3 you calculated the middle 95% and 68% boundaries for the following data:
In a normal distribution, how many standard deviations above and below the mean are the upper and lower boundaries of the middle 68% interval? Give an answer rounded to at least 4 decimal places.